Improved Strong Worst-case Upper Bounds for MDP Planning

نویسندگان

Anchit Gupta

Shivaram Kalyanakrishnan

چکیده

The Markov Decision Problem (MDP) plays a central role in AI as an abstraction of sequential decision making. We contribute to the theoretical analysis of MDP planning, which is the problem of computing an optimal policy for a given MDP. Specifically, we furnish improved strong worstcase upper bounds on the running time of MDP planning. Strong bounds are those that depend only on the number of states n and the number of actions k in the specified MDP; they have no dependence on affiliated variables such as the discount factor and the number of bits needed to represent the MDP. Worst-case bounds apply to every run of an algorithm; randomised algorithms can typically yield faster expected running times. While the special case of 2-action MDPs (that is, k = 2) has recently received some attention, bounds for general k have remained to be improved for several decades. Our contributions are to this general case. For k ≥ 3, the tightest strong upper bound shown to date for MDP planning belongs to a family of algorithms called Policy Iteration. This bound is only a polynomial improvement over a trivial bound of poly(n, k) · k [Mansour and Singh, 1999]. In this paper, we generalise a contrasting algorithm called the Fibonacci Seesaw, and derive a bound of poly(n, k) · k. The key construct that we use is a template to map algorithms for the 2action setting to the general setting. Interestingly, this idea can also be used to design Policy Iteration algorithms with a running time upper bound of poly(n, k)·k. Both our results improve upon bounds that have stood for several decades.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing and Using Lower and Upper Bounds for Action Elimination in MDP Planning

We describe a way to improve the performance of MDP planners by modifying them to use lower and upper bounds to eliminate non-optimal actions during their search. First, we discuss a particular state-abstraction formulation of MDP planning problems and how to use that formulation to compute bounds on the Q-functions of those planning problems. Then, we describe how to incorporate those bounds i...

متن کامل

Posterior sampling for reinforcement learning: worst-case regret bounds

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of Õ(D √ SAT ) for any communicating MDP with S states, A actions and diameter D, when T ≥ SA. Here, reg...

متن کامل

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

متن کامل

Near-optimal Reinforcement Learning in Factored MDPs

Any learning algorithm over Markov decision processes (MDPs) will have worst-case regret Ω( √ SAT ) where T is the elapsed time and S and A are the cardinalities of the state and action spaces. In many settings of interest S and A may be so huge that it is impossible to guarantee good performance for an arbitrary MDP on any practical timeframe T . We show that, if we know the true system can be...

متن کامل

Upper and Lower Bounds on the Time Complexity of Infinite-Domain CSPs

The constraint satisfaction problem (CSP) is a widely studied problem with numerous applications in computer science. For infinitedomain CSPs, there are many results separating tractable and NP-hard cases while upper bounds on the time complexity of hard cases are virtually unexplored. Hence, we initiate a study of the worst-case time complexity of such CSPs. We analyse backtracking algorithms ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Improved Strong Worst-case Upper Bounds for MDP Planning

نویسندگان

چکیده

منابع مشابه

Computing and Using Lower and Upper Bounds for Action Elimination in MDP Planning

Posterior sampling for reinforcement learning: worst-case regret bounds

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

Near-optimal Reinforcement Learning in Factored MDPs

Upper and Lower Bounds on the Time Complexity of Infinite-Domain CSPs

عنوان ژورنال:

اشتراک گذاری